Optimal Matching
   HOME

TheInfoList



OR:

Optimal matching is a
sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignm ...
method used in
social science Social science is one of the branches of science, devoted to the study of societies and the relationships among individuals within those societies. The term was formerly used to refer to the field of sociology, the original "science of soc ...
, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a
cohort Cohort or cohortes may refer to: * Cohort (educational group), a group of students working together through the same academic curriculum * Cohort (floating point), a set of different encodings of the same numerical value * Cohort (military unit ...
) classical tools (such as
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
) can be used. The method was tailored to social sciences from a technique originally introduced to study molecular biology (protein or genetic) sequences (see
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
). Optimal matching uses the Needleman-Wunsch algorithm.


Algorithm

Let S = (s_1, s_2, s_3, \ldots s_T) be a sequence of states s_i belonging to a finite set of possible states. Let us denote the sequence space, i.e. the set of all possible sequences of states. Optimal matching algorithms work by defining simple operator
algebras In mathematics, an algebra over a field (often simply called an algebra) is a vector space equipped with a bilinear product. Thus, an algebra is an algebraic structure consisting of a set together with operations of multiplication and addition a ...
that manipulate sequences, i.e. a set of operators a_i: \rightarrow . In the most simple approach, a set composed of only three basic operations to transform sequences is used: * one state s is inserted in the sequence a^_ (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T) * one state is deleted from the sequence a^_ (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots s_T) and * a state s_1 is replaced (substituted) by state s'_1, a^_ (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T). Imagine now that a ''cost'' c(a_i) \in ^+_0 is associated to each operator. Given two sequences S_1 and S_2, the idea is to measure the ''cost'' of obtaining S_2 from S_1 using operators from the algebra. Let A= be a sequence of operators such that the application of all the operators of this sequence A to the first sequence S_1 gives the second sequence S_2: S_2 = a_1 \circ a_2 \circ \ldots \circ a_ (S_1) where a_1 \circ a_2 denotes the compound operator. To this set we associate the cost c(A) = \sum_^n c(a_i), that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences A that transform S_1 into S_2; a reasonable choice is to select the cheapest of such sequences. We thus call distance
d(S_1,S_2)= \min_A \left \
that is, the cost of the least expensive set of transformations that turn S_1 into S_2. Notice that d(S_1,S_2) is by definition nonnegative since it is the sum of positive costs, and trivially d(S_1,S_2)=0 if and only if S_1=S_2, that is there is no cost. The distance function is
symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
if insertion and deletion costs are equal c(a^) = c(a^); the term ''indel'' cost usually refers to the common cost of insertion and deletion. Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.


Criticism

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. WuL. L. Wu. (2000)
Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect"
'' Sociological Methods & Research, 29 41-64. {{doi, 10.1177/0049124100029001003
), the main problem in the application of optimal matching is to appropriately define the costs c(a_i).


Software



is a powerful program, offering access to some of the latest developments in transition data analysis.

has implemented a package to run optimal matching analysis.
TraMineR
is an open source R-package for analyzing and visualizing states and events sequences, including optimal matching analysis.


References and notes

Data mining Statistical distance Quantitative research